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ABSTRACT 

With the advent of high-throughput sequencing 
technologies, the rapid generation and accumulation 
of large amounts of sequencing data pose an insur- 
mountable demand for efficient algorithms for con- 
structing whole-genome phylogenies. The existing 
phylogenomic methods all use assembled se- 
quences, which are often not available owing to the 
difficulty of assembling short-reads; this obstructs 
phylogenetic investigations on species without a ref- 
erence genome. In this report, we present co-phylog, 
an assembly-free phylogenomic approach that 
creates a 'micro-alignment' at each 'object' in the 
sequence using the 'context' of the object and cal- 
culates pairwise distances before reconstructing the 
phylogenetic tree based on those distances. We 
explored the parameters' usages and the optimal 
working range of co-phylog, assessed co-phylog 
using the simulated next-generation sequencing 
(NGS) data and the real NGS raw data. We also 
compared co-phylog method with traditional align- 
ment and alignment-free methods and illustrated 
the advantages and limitations of co-phylog 
method. In conclusion, we demonstrated that co- 
phylog is efficient algorithm and that it delivers high 
resolution and accurate phylogenies using whole- 
genome unassembled sequencing data, especially 
in the case of closely related organisms, thereby sig- 
nificantly alleviating the computational burden in the 
genomic era. 

INTRODUCTION 

Recent advent of high-throughput sequencing tech- 
nologies enabled the completion of sequencing effort in 
> 1000 species, most of which are prokaryotes. This 
achievement has brought new opportunities to many 



research areas in biological sciences, especially in recon- 
structing the phylogeny of those species. Traditional 
methods in phylogenetic analysis are based on alignment 
of genes or segments. For prokaryotes, the 16S ribosomal 
RNA gene (or 16S rDNA) is the sequence of choice for 
phylogenetic analysis given that it exists in almost all pro- 
karyotic organisms, and it rarely undergoes horizontal 
gene transfer. However, 16S rDNA is highly conserved, 
so that it provides a limited resolution for closely related 
species. This problem could be possibly circumvented by 
selecting less conserved genes, but individual genes may 
reveal inconsistent and sometimes biased phylogenies. 

Given the genomic data that are now available for many 
organisms, several studies have turned to whole-genome 
data to construct phylogenies, and these phylogenomic 
trees typically have much higher resolution than those 
based on a single gene. The methods developed for 
phylogenomic analysis thus far can be classified into 
alignment-based methods (1-3) and alignment-free 
methods (4,5). Alignment-based methods are two-phase 
procedures that first create multiple sequence alignment 
(MSA) among the input sequences and then reconstruct 
the phylogenetic tree based on these MSA. In evolutionary 
biology, MSA has long believed to be a necessary pre- 
requisite for making accurate inferences regarding phyl- 
ogeny, but this viewpoint has recently been increasingly 
questioned (6-8). MSA is a combinatorial optimization 
problem that is known to be NP-hard (9,10). If these 
methods were applied to genomic data from high- 
throughput sequencing, the analysis would be unafford- 
able computationally. 

Alignment-free methods are proposed to bypass the 
computational difficulties arising from MSA. They calcu- 
late the distances between pairwise organisms using 
oligopeptide word usage frequencies (5,11) or information 
measurements, such as Kolmogorov complexity (12,13) 
and Lempel-Ziv complexity (14). The recently proposed 
average common substring approach is based on Kullback- 
Leibler relative entropy (4), and the distance in this 
approach reflects the average length of the maximum 
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common substring of the paired sequences. Composition 
vector tree (CVtree) (5), singular value decomposition (1 1) 
and recent feature frequency profiles (FFPs) methods (15) 
are similar approaches, and all of the approaches are 
based on 'word frequencies'. However, these alignment- 
free phylogenomic methods have their own problems. 
For example, distances measured using information 
theory or word usage frequencies do not typically have a 
clear biological definition and they are rarely linear with 
evolutionary time. 

Next-generation sequencing (NGS) technologies 
provide unprecedented throughput and have resulted in 
the efficient and inexpensive generation of many 
genomes. However, the reads that NGS technologies 
generate are far shorter than those generated by trad- 
itional Sanger sequencing. The assembly of complete 
genomes using NGS is very time-consuming and may be 
impossible when the genome contains a large proportion 
of repetitive segments. To bypass the computational 
difficulties arising from assembly, several assembly-free 
methods have been proposed for comparative genomics 
(16), or identifying single-nucleotide polymorphism 
(SNP) (17,18). However, there is still no method could 
conduct phylogenomic analysis without genome assembly. 

Here, we propose a new phylogenomic approach, 
co-phylog, which is not only as efficient as the existing 
alignment-free approaches but also as accurate as the 
alignment-based methods. Moreover the co-phylog 
method can take advantage of unassembled NGS data 
from complete genomes. In the several genera that we 
have analyzed to date, co-phylog yielded high-resolution 
trees using both complete genome data and NGS data, 
and the trees constructed were highly similar with the 
benchmark trees constructed using traditional alignment- 
based methods. 

This article is organized as follows. The 'Materials and 
Methods' section introduces the 'context-object' model 
and the co-phylog algorithm and describes the methods, 
datasets and benchmarks used for the experiments used to 
assess the algorithm. The 'Results' section reports and 
analyzes the results of the assessment experiments indi- 
vidually and reports the space and time consumption of 
the co-phylog algorithm. The 'Discussion' section elabor- 
ates on the similarities and differences between the 
co-phylog method and the alignment and alignment-free 
methods while emphasizing the advantages and limita- 
tions of the co-phylog method. The 'co-phylog* package 
is available at http://humpopgenfudan.cn/resources/ 
softwares/CO-phylog.tar.gz. 

MATERIALS AND METHODS 

Key concepts in the proposed model 

Let us first briefly review the process of the sequences 
alignment. At the beginning of sequences alignment 
process, all seed matches between the whole query and 
subject sequences are found and then extended into 
longer alignments using dynamic programming. The 
seed match could be an exact match (consecutive seed) 
or an approximate match (spaced seed). Ma, Tromp and 



Li proposed using a 0-1 string to describe a seed model 
where a 1-site represents required match, and O-site is 
'don't care'. For example, if a seed 1110111 is used, then 
'actgact' versus 'acttact' and 'actgact' versus 'actgact' are 
seed matches (19). We can now introduce several new 
concepts used in our context-object model. 

Structure 

A structure S of the seed (or just structure) is the formula 
C a i,a2,..anObi.b2,..im-i, where ttj (i from 1 to n) and b, (i from 
1 to n — 1) are the lengths of the I th consecutive Is segment 
and the z th consecutive Os segment, respectively. For 
example, the seed 1110111 has a structure 5" = Cj^Cv. It 
is clear that the C part of the structure 5 has a length 
L(C S ) = X)t=i a > an d that the O part has a length 
L(O s ) = OOAOJXi 1 ^. The length of the structure (or 
the seed) is L(S) = L(C S ) + L(O s ) = £"=i a t + b t 
(Figure la). 

C-gram and O-gram 

Suppose a structure 5 = Q a i,a2,..an^bi,b2,..bn-i- Let 
w = S]S 2 . ■ -Sk be a Ar-tuple of length k = L(S), and divide 
w into In— 1 parts from left to right with lengths of a lt b;, 
a 2 , b 2 ,-, a n .i, b n .j and a, h respectively. Then the C-gram of 
w, denoted by C^yv), is the concatenation of the first, 
third,., (2«— 7) th parts of w, and the O-gram of w, 
denoted by O^w), is the concatenation of the second, 
fourth,., (In— 2) th parts of w (Figure lc). For example, 
given S = C 3 3 0; and w = actgact, then C s (w) = actact 
and O s (w) = g. 

The k-tuples set 

Given a genome (either assembled or not, denoted by G or 
G', respectively), the A;-tuples set, denoted by H k G or 
H k G ', consists of all the overlapped k-tuples from both 
the genome and its reverse-complement counterparts 
(Figure lb, see Supplementary Data for the formal 
definition). 

Context and object 

Given a structure 5" and a genome G, we have a /c-tuple 
(k = L(S) set Hl(S),g- For an arbitrary C-gram c in the 
genome G, its objects, denoted by object GS (c), are the set 
{Os(w): H'e//i ( 5),c and Cs(w) = c}, which, namely, 
consists of all the O-grams of the L(5)-tuples from G 
whose C-grams are c. The C-gram c is a context if and 
only if the set object c,s( c ) has only one element (Figure 
Id). For example, given 5=CjjO;, suppose genome 
G = '. . . AGGTCCCTGGA . . . AGGGCCCTGGA . . .', 
then the C-gram 'AGGCCC is not a context because it 
has two objects T' and 'G', whereas the C-gram 'CCCGG 
A' is a potential context because it has an unique object 
T'. For convenience, we use the notation C S (G) to denote 
the set of all of the contexts in G, and S(G) to denote the 
set of all of the context-object pairs in G. 

Context-object distance 

Suppose Gi and G 2 are two genomes to be compared; 
given a structure S, we have their context sets C S (G;) 
and C S (G 2 ), respectively. The intersection of the two 
context sets (denoted by R, use \R\ to denote the 
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,0 if Object:, sG = Object C 

\ 1 if Object^C * Object,,:, ,& 
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Distance(A,B) = — — 

R\ 

Figure 1. The algorithm overview, (a) Some examples of structure S. (b) The £-tuple sets H k GI and H kG2 that generated from genome Gj and 
genome G 2 , respectively, given a structure S= C 22 0]. (c) C-gram-O-gram pairs generated from the corresponding /c-tuple sets, (d) Context-object 
pairs generated from the corresponding C-gram-O-gram pairs, (e) Shared Context and their corresponding objects in G] and G 2 . (f) The computing 
of context-object distance between G] and G 2 . 



number of members in R) contains all of the common 
contexts (Figure le). For i from 1 to \R\, let /, = 0 if 
object Gl s (ci) = object G2 ,s(ci), otherwise let I, = 1, where 
Cj is the i member of R. The context-object distance 
(or co-distance) between Gi and G? is given by 

d C0 {G l G 2 )=^^. (1) 

In other words, co-distance is the proportion of shared 
contexts, the two objects of each of which are different in 
their respective genomes (Figure If). 

The algorithm co-phylog and its complexity 

The algorithm co-phylog takes as input N genomes G lt 
G 2 ,., G N , which can be either assembled or not, and the 
( N\ 

outputs are I ^ I pairwise co-distances (Figure lb-f). The 

algorithm is composed of the following two phases: 

(1) Convert the input genomes to their respective sets of 
context-object pairs (Figure lb-d): given a structure 
S, for each input genome G in fasta format 
(assembled genome), we index each O-gram in G 
by its respective C-gram. If different O-grams with 
the same C-gram occur while indexing the genome, 



this C-gram is flagged. After all of the O-grams are 
indexed, the unmarked C-grams and their respective 
O-grams, i.e. the context-object pairs, are output. 
This process is formally expressed by the sub- 
algorithm fasta2co (see Supplementary Data). For 
each genome G' in fastq format (unassembled raw 
data), we need to first filter low-quality L(S)-tuples. 
Let W be an L(5')-tuple on a read of G' . If the lowest 
value of all of the L(S) base qualities of W, denoted 
by min(W), is smaller than a specific threshold, F, 
then the W is discarded. For the L(5)-tuples that 
pass through filtering, the indexing is performed as 
in fastdZco. This process is formally expressed by the 
sub-algorithm fastq2co (see Supplementary Data). 

(2) Compute pairwise co-distances on the sets of context- 
object pairs using Equation (1) (Figure le— f). This 
process is formally expressed by the sub-algorithm 
co2distance (see Supplementary Data). 

(3) Suppose the mean genome size for the organisms 
is M mean and that the mean sequencing depth is d mean 
(the depth of the assembled genome is 1), then, at 
most, phase 1 requires 0(M mean x d mi , em x N) time, 

/N\ 

and phase 2 requires 0(M mecm x I ^ I) time (see 
Doc. SI for the detailed analyses). 
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(4) Once all pairwise co-distances are computed, we use 
the neighbor-joining (NJ) method (20) to construct 
phylogenetic trees. 

The assessment methods, datasets and benchmarks 

The proposed co-phylog algorithm was first assessed using 
only assembled genomes to explore the proper parameters 
and the acceptable working range of the algorithm. The 
algorithm was then assessed on the unassembled whole- 
genome sequencing data, using the phylogenies based on 
the corresponding assembled genomes as benchmarks. 
The full assessment experiments, the corresponding 
datasets (all of the accession numbers for the datasets 
used are provided in Supplementary Table SI) and the 
benchmarks used are introduced below. 

Robustness testing by varying context I object lengths on 
Brucella 13 genomes 

We first assessed if the co-phylog method is robust to dif- 
ferent context and object lengths. For convenience, we 
used the simple structures C a _ a O n with context and 
object lengths that could be adjusted by choosing different 
values for a and n, and we only choose a > 8 for test, 
which allowed the majority [>99%, according to 
Supplementary Equation (S2) in Supplementary Data] of 
the C-grams to be the contexts. The co-phylog trees were 
constructed using seven different structures, 5 = Cg gOj, 
CggOj, CiojqO], Cn.nOh CjsjsOj, C/J/5O2 and 
and took as input the Brucella 13 genomes 
dataset (including 12 complete genomes from the genus 
Brucella and an out-group genome from Ochrohactrum 
anthropi). The resulting trees were then compared with 
the benchmark tree constructed using the same dataset. 

The benchmark tree comes from the work of Foster 
et al., in which they first created all pairwise whole- 
genome alignments using MUMmer and then grouped 
the SNPs by shared locations to compare across all taxa. 
Foster et al. (21) next analyzed the SNPs multiple align- 
ment using the best substitution model as selected by 
ModelTest, and finally constructed a phylogenetic tree 
using the NJ method and verified the tree using different 
methods. 

Tests on the Escherichia! Shigella ^6 genomes 

We next assessed the algorithm using 26 completed 
genomes from the genus Escherichia/ Shigella. The 
accuracy of co-phylog was evaluated based on the symmet- 
ric differences between the co-phylog (where S = Cg gOj) 
tree and the benchmark tree. Two other phylogenomic 
tools, CVtree (http://tlife.fudan.edu.cn/cvtree/) and Kr 
(22) (http://guanine.evolbio.mpg.de/kr2/) were also used 
to build trees, and the trees' accuracies were evaluated in 
the same way. We then made comparisons of the trees' 
accuracies among the different phylogenomic methods. 

The symmetric differences were evaluated using the 
'treedist' program that is contained in the PHYLIP 
package (http://evolution.gs.washington.edu/phylip.html). 
The benchmark tree that was constructed using the same 
dataset from the work of Zhou et al. (23), in which they 
concatenated the alignments of the 2034 core genes of the 



Escherichia! Shigella 26 genomes and used the maximum 
likelihood method to infer the phylogenetic relationships. 

The accuracy of the co-phylog method was also 
evaluated via a correlation analysis between the 
co-distance and the standard /7-distance from whole- 
genomes alignment of the Escherichia! Shigella 26 
genomes. Parallel correlation analysis tasks are also im- 
plemented using the CF?ree-distance and the &-distance 
(as generated by the corresponding tools). 

The benchmark p-distances were generated by an 
in-house Perl script, using the web file 40 way 
Escherichia! Shigella genomes alignment (http://www 
. biotorrents. net/details. php?id = 87), which includes all 
the Escherichia) Shigella 26 genomes. This alignment was 
previously produced by the MSA tool progressiveMauve 
(24) (http ://gel.ahabs.wisc.edu /mauve/) . 

Tests on Enter obacteriaceae 63 genomes and 
Gammaproteobacteria 70 genomes 

We next examined if the co-phylog method was feasible 
when applied to high-level taxonomies. In the first stage 
of the experiment, we tested co-phylog (S = C 9 9 0 /) at the 
family level using 63 genomes randomly picked from 
Enterobacteriaceae. The reconstructed phylogenetic rela- 
tionship based on 16S rDNA sequences alignment is 
used as the benchmark. We then tested co-phylog 
(S = C 99 0 1) on the class level using 70 genomes 
randomly picked from Gammaproteobacteria (we skipped 
the order level because Enterobacteriaceae is the only 
family under the order it belongs to). This co-phylog tree 
was compared with the known taxonomy. 

The 16S rDNA tree was generated as follows. For each 
organism, we first retrieved its 16S rDNA sequence using the 
'Browsers' on the Ribosomal Database Project (http://rdp 
.cme.msu.edu/index.jsp) website, and then created MSA of 
these 16S rDNAs and built a tree using the 'Tree Builder' 
tools (http://rdp.cme.msu.edu/treebuilderpub/index.jsp) (25). 

Explore the acceptable working range of co-phylog using 
in silico evolution 

As a complementary experiment to the performance 
testing on high-level taxonomies, this experiment was 
designed to provide insights into that how far distant the 
two compared genomes are would significantly affect the 
accuracy of the computed co-distance. The artificial life 
framework, ALF (26), which can simulate the entire 
range of evolutionary forces (e.g. substitution, indels, 
gene loss/duplication, GC-content amelioration and 
lateral gene transfer), was adopted to evolve an ancestor 
genome into two descendant genomes with a specified evo- 
lutionary divergence. The co-distances and the common 
context counts between the two evolved genomes were 
then computed. The in silico evolution was repetitively 
implemented with a gradually increased evolutionary di- 
vergence. After the in silico evolution experiments were 
completed, the relationships between the specified evolu- 
tionary divergences and the corresponding co-distances 
and common context counts were analyzed. 

The parameters for ALF simulation were as follows. The 
ancestor genome, Escherichia coli 536 (NC_008253.ffn), 
was evolved into two genomes over 150 runs with an 
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initial substitution rate of 0.01 substitutions per site, and 
each run increased the number of substitutions per site by 
0.01 (the rates of other evolutionary events was increased 
proportionally with the default coefficient defined in the 
ALF parameters file). The substitution models used were 
CPAM' and 'TN93' indels: Zipfian; the variation among 
sites model: rates; the gene number in group later gene 
transfer (gLGT): 10; and the other parameters follow the 
default setting. 
We were then ready to test co-phylog on NGS data. 

Tests on simulated NGS datasets 

We first evaluated that how large of a proportion of the 
genome had to be sequenced to create a faithful tree using 
co-phylog by in silico sequencing on the 'sequencing 
sample genomes'. Supplementary Equation (SI) (see 
Supplementary Data) suggests that the proportion of the 
genome sequenced by perfect in silico sequencing could be 
adjusted through specifying either the mean reads length 
or the number of reads (or sequencing depth). For con- 
venience, we generated five perfect NGS datasets that only 
varied in sequencing depth (depth = 2x, 6x, 16x, 30 x 
and 50 x) using an in-house Perl script and the Brucella 
13 genomes as the 'sequencing sample genomes'. This 
means that each of the five test NGS datasets consists of 
13 unassembled counterparts (G /, G'2,-, G 13) at the same 
depth for the Brucella 13 genomes (G/, G 2 ,., G 13 ). All of 
the reads simulated in the perfect NGS datasets were 
75 bp, error-free and uniformly distributed, which 
allowed us exclude any variation introduced by the 
sequencing experiment itself. The corresponding 
co-phylog (with 5 = C/5 /5O7) tree was constructed using 
each of the five perfect NGS datasets as input, and the 
benchmark tree was constructed using the 'sequencing 
sample genomes' as input. The minimal proportion P of 
the genome that was required by co-phylog was estimated 
by finding the depth at which the tree generated begins to 
be identical to the benchmark tree. 

When the co-phylog method was applied to a real NGS 
dataset, L(S)-tuple with a minimum base quality under the 
threshold F were filtered (see the algorithm section). 
A dilemma in choosing the F value was that too small 
of an F might allow too many L(5)-tuple with 'wrong' 
objects past the filtering and therefore enlarge the devi- 
ation of the co-distance computed, while too large of an 
F might filter too much genomic information. We there- 
fore explored the proper value range of F using simulated 
NGS data with sequencing qualities, which were generated 
using the tool 'Maq simulation' in the MAQ package 
(http://maq.sourceforge.net/). MAQ NGS data (distin- 
guished from the perfect NGS data, using genome B. 
abortus 2308) of different depths were generated and dif- 
ferent F values were tested on these MAQ NGS data; the 
proper range of F values were determined according to 
the co-distance d co (G , G) between the MAQ NGS data 
G and the complete genome G and the proportion q of 
genomic information taken by co-phylog. 

Tests on veal NGS datasets 

Next, we applied co-phylog to the real NGS datasets. By 
retrieving the NCBI Short Reads Archive database, we 



collected 29 Escherichia coli organisms for which the 
NGS raw data and assembled genomes were both avail- 
able (see Supplementary Table SI). A co-phylog tree con- 
structed using the real NGS dataset for the 29 E. coli 
organisms was compared with the tree constructed using 
the respective assembled genomes. We also attempted the 
co-phylog tool on large diploid genomes to see if co-phylog 
is computationally affordable to the large size analyses 
and the additional complication of diploidy. Five mam- 
malian organisms (including four primates, Otolemur 
garnetii, Saimiri boliviensis, Gorilla gorllia and Homo 
sapiens, and a out-group Bos grunniens mutus), all of 
which have abandon NGS data (average sequencing 
depth » 80 x), were used for phylogenomic analysis by 
co-phylog. Then the space and time consumption were 
analyzed. 

RESULTS 

Performance of co-phylog with varied 
context/object lengths 

The comparison shows that the co-phylog tree and the 
benchmark tree are nearly identical (Figure 2), illustrating 
the accuracy of co-phylog method on closely related or- 
ganisms. It also shows, using S = C„ u O„ with varied a and 
n, the co-phylog trees constructed have nearly identical 
shape, suggesting that co-phylog is robust at different 
context/object lengths when applied on closely related or- 
ganisms. However, larger « produces trees with longer 
branch lengths, and this is because co-phylog method 
creates a 'micro-alignment' between two genomes 
compared (see 'Discussion' section) and estimates the 
average nucleotide substitution rate that measured by sub- 
stitutions per n sites, therefore larger n would result in 
higher substitution rate calculated. 

Performance on Escherichia! Shigella 26 genomes 

Co-phylog tree based on the Escherichia /Shigella 26 
genomes shows highly similar topology relative to the 
benchmark tree (symmetry difference = 4). The branch 
lengths are proportional to the benchmark tree but 
shorter (Figure 3 a and b). This result occurs because the 
branch lengths in the co-phylog tree represent the average 
substitution rates of those sites with unchanged flanking 
sequences (namely, 'context') between two compared 
genomes, these sites are generally more conserved than 
the whole-genome average. The most significant difference 
between CVtree and the other methods is that the genus 
Shigella violates the monophyleticity of the genus 
Escherichia but not the monophyleticity of the E. coli 
strains (symmetry difference = 20). Similar results were 
also achieved with the FFPs method (15). Note that the 
CVtree and FFPs distances represent the extent of the 
difference in the 'word frequencies' features of two 
compared genomes, while co-distance and the 
alignment-based distance estimate average nucleotide (or 
amino acid) substitution rate, which is more constant 
across evolutionary history (see 'Discussion' section). 
One possible explanation is that the CVtree and FFPs 
trees represent the taxonomy based on genomic features, 
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Benchmark 



Organisms List 
B. abortus 2308 
B. abortus S19 
B. abortus 9-941 
B- melitensis 16M 
B. melitensis 63 9 
B. melitensis Ether 
B. canis 23365 
B. suis 40 
B. suis 686 
B. suis 1330 

B. suis 23445(Thomsen) 
B. ovis 25840 



C8,sOi 




C9,90l 




I 1 

0.05 substituions/site 



Cio.ioOi 



Cu.uOi 





Cl5,150l 




Cl5,1502 



C15.1504 



H 

0.0002 



Figure 2. Comparisons of the alignment-based tree and the co-phylog trees constructed with different structures, on the Brucella 13 genomes. All the 
trees share the same organisms list. The Ochrobactrum anthropi genome is adopted as the out-group taxon. 



whereas the co-phylog and alignment-based trees represent 
the phylogenetic relationship. Another alignment-free 
method, Kr, was developed to efficiently estimate the 
pairwise distances between genomes and is 'more 
accurate than model-free approaches including the 
average common substring' (22). However, according to 
this test, the Kr tree was still much less accurate than the 
co-phylog tree as measured by both topology and branch 
length (Figure 3a), demonstrating the accuracy of 
co-phylog in establishing the phylogeny of closely related 
organisms. The only inconsistencies between the 
co-phylog tree and the benchmark tree were observed 
on the E. coli CFT073/E. coli 536/E. coli EDla branch. 



We found that the difference could be avoided by deleting 
just E. coli CFT073 (data not shown). It appears that 
the accuracy of co-phylog methods might be slightly 
affected if genomes undergo extensive reorganization 
(such as duplication or recombination) as in the case of 
E. coli CFT073 (27). 

The correlation analysis indicated that the co-distance 
fit well with the p-distance (Figure 3c) and had a 
correlation coefficient of 0.9919. As a comparison, the cor- 
relation coefficients versus the p-distance for the other two 
distances, the CK/ree-distance and the A>-distance are 
0.3464 and 0.7796, respectively. The significant linear 
relationship between the co-distance and the p-distance 
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Figure 3. (a) The benchmark tree constructed based on multiple genomes alignment and the trees constructed by the three methods, co-phylog 
(S = C 9! ,Oy), CVtree and Kr, on the Escherichia! Shigella 26 genomes. The number near the node represents the bootstrap value (see Doc. SI for 
details). And (b) the symmetric differences of the benchmark tree against the trees constructed by the three methods, co-phylog, CVtree and Kr. 
(c) Correlation analyses between the p-distance and each of the three distances, co-distance, CF^ree-distance and Kr-distance. These four types of 
distances are generated from the pairwise comparisons of the Escherichia coli/ Shigella 26 genomes, using multiple genomes alignment, co-phylog, 
CVtree and Kr, respectively. 



are also seen in other closely related organisms based 
on our test data using primate mitochondria DNA 
alignments (data not shown). This linear relation- 
ship explains why the co-phylog tree agrees so well 
with the alignment-based tree and illustrates that the 
co-phylog delivers accurate phylogenies of closely related 
organisms. 



Performance on Entevobacteriaceae and 
Gammaproteobacteria 

The comparisons on the phylogenies of the Enter- 
obacteriaceae 63 genomes show that the 16S rDNA tree 
and the co-phylog tree agreed well in general. However, the 
genera Enterohacter and Citrobacter are polyphyletic 
groups, and the genus Yersinia is a paraphyletic group 
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in the 16S rDNA tree, while all these genera formed a 
single clade in the co-phylog tree, illustrating that 
co-phylog has a much higher resolution than the 16S 
rDNA tree at the family level (Figure 4). However, 
when co-phylog is applied to Gammaproteobacteria 70 
genomes, the accuracy of the constructed tree significantly 
diminishes. Co-phylog tree showed that Enter obacteriales, 
Xanthomonadales, Pasteurellales and Thiotrichales still 
formed a clade, whereas the other orders formed paraphy- 
letic or polyphyletic groups (Supplementary Figure SI). 
The performance on taxonomy levels higher than class 
were also tested but were found to be even worse than 
at the class level (data not show). This exercise demon- 
strates the limits of co-phylog in phylogeny construction. 

The optimal working range of co-phylog as determined 
by in silico evolution 

The in silico evolution experiment showed that the 
co-distance varied significantly starting at the 90th run, 
which corresponds to a divergence of 0.90 substitutions 
per site between the two evolved genomes (Figure 5). 
The major genomic variation introduced in the 90th evo- 
lution was a gLGT event that copied 10 genes from one 



genome to another; the same gLGT event also occurred at 
the 80th, 74th, 65th and several other previous runs. 
However, those occurrences did not significantly affect 
the computed co-distance. The reason is obvious; the se- 
quences 'copied' to the genome by the recent gLGT are 
nearly identical with the original ones on another genome, 
which causes the co-distance between the two genomes to 
be underestimated. When two genomes are closely related, 
there are many common contexts between them; the new 
common contexts that occurred due to the 10 genes gLGT 
only affected the co-distance weakly, but two distant 
genomes have few common contexts (e.g. <5000), and 
the new common contexts that occurred owing to a 10 
genes gLGT (~ 10 000 common contexts) would make up 
the majority of the common contexts, thus significantly 
biasing the computed co-distance. This experiment indi- 
cates that when the common context count > 1 50 000 (cor- 
responding to a co-distance <0.12 or a real divergence of 
<0.8 substitution per site if S = C 9 9 Oj), the two genomes 
being compared are close enough to guarantee the stability 
of the computed co-distance. If using 5" = C/2,/20/, then 
'close enough' genomes should satisfy the criteria of the 
divergence being <0.58 substitution per site and have a 
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Figure 4. Comparison between the 16S rDNA tree and the co-phylog tree, constructed on the Enterobacteriaceae 63 genomes. The number near the 
node represents the bootstrap value (see Supplementary Data for details). 
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Figure 5. The changing of the co-distances and the log number of the common context counts computed between two genome evolved in silico, with 
gradually increased evolutionary divergence (substitutions per codon), using two structures S = C99O; and C12J2O1. 



co-distance <0.09, indicating that using long contexts 
loses more distant homologies. Therefore, when the 
criteria given in Supplementary Equation (S2) (see 
Supplementary Data) are satisfied, it is better to choose 
shorter contexts when applying co-phylog to more distant 
genomes. This estimation may be different when taking 
into account more frequent, recent gLGTs. Fortunately, 
relevant research has suggested that recent gLGTs in two 
distant genomes are rare (about 7 genes) (28), making the 
conclusions from our 10-gene gLGT simulation more 
reliable. 

Performance on simulated NGS datasets 

Through comparing the co-phylog trees constructed using 
the perfect NGS datasets with the benchmark tree 
(Supplementary Figure S2a). We found that when the 
perfect NGS datasets are 'deeper' than 6x, the co-phylog 
tree generated is identical to the benchmark tree. Using 
B. abortus 2308 (one of the Brucella 13 genomes; the other 
12 genomes are of similar lengths) as a representation, the 
L(S)-tuples count of the complete B. abortus 2308 genome 
G is \H KG \ = 6485 644, and the 6x perfect NGS data G of 
the B. abortus 2308 genome generated \H kG >\ = 6 366 349; 
therefore, the proportion q of L(5)-tuples in \H kG \ that 



were included in \H kG \ is 0.98 (see the ' The k-tuples sef 
section for the definitions of \H kG \ and \H kG \), which 
indicates that the minimal proportion, P, of the genome 
required by co-phylog is ~0.98. This value is close to a 
proportion of 0.97 estimated by Supplementary 
Equation (SI), illustrating the accuracy of Supplementary 
Equation (SI) (see Supplementary Data). 

The experiment on the MAQ NGS data G of the 
B. abortus 2308 genome G shows that the smaller the 
qualities filter threshold F is or the 'deeper' the MAQ 
NGS data are, the larger the computed co-distance 
d co (G , G) is (Table 1), as anticipated. Suppose that G 1 
and G 2 are the NGS data generated from the 'sequencing 
sample genomes' Gj and G 2 , respectively. According to the 
triangle inequality, we have d co (G /, G 2 ) < d co {G 1, G 2 ) + 
d co {G 2 , G 2 ) < d co {Gj, G 2 ) + d C0 (Gj, Gj) + d co {G 2 , G 2 ); 
therefore, 

|4o(G/i,G/2Ko(Gi,G2)|4o(Gi,G/iK 0 (G2,G/ 2 ) (2) 

Equation (2) illustrates that the co-distance, d co (G, G), 
determines the extent to which the co-distance computed 
using NGS data deviates from the real co-distance of the 
two compared genomes and therefore determines the limit 
of the resolution of the constructed co-phylog tree. For the 
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Table 1. The co-distances d co (G' , G) and the proportion q% (the bracketed value) of L(S)-tuples taken by co-phylog computed for each depth-i 7 
combination 

F The depths of the MAQ NGS data 





2x 


6x 


16x 


30x 


50 x 


0 


6.5e-05 (70.0) 


l.le-04 (97.1) 


2.8 e-04 (100) 


5.1 e-04 (100) 


8.7 e-04 (100) 


5 


5.3e-06 (57.8) 


l.le-05 (92.4) 


1.6 e-05 (99.9) 


3.0 e-05 (100) 


5.0 e-05 (100) 


10 


3.3e-06 (46.2) 


5.5e-06 (84.2) 


8.4 e-06 (99.3) 


1.4e-05 (100) 


1.8 e-05 (100) 


15 


3.9e-06 (31.4) 


3.2e-06 (67.5) 


3.9 e-06 (95.0) 


4.3 e-06 (99.6) 


9.2 e-06 (100) 


20 


2.6e-06 (11.2) 


2.0e-06 (31.2) 


2.0 e-06 (63.4) 


2.5 e-06 (84.7) 


1.3 e-06 (95.5) 


25 


0 (2.3) 


4.6e-06 (6.7) 


1.8 e-06 (17.3) 


0 (29.6) 


7.0 e-07 (44.3) 


30 


0 (0.02) 


0 (0.04) 


0 (0.11) 


0 (0.21) 


0 (0.33) 


35 


NA (0) 


NA (0) 


NA (0) 


NA (0) 


NA (0) 



NA represents that the d co (G' , G) cannot be computed because there is no L(5)-tuples taken. 



phylogenetic analysis at the genus or species level, d co (G' , 
G) < le-5 ensure sufficiently high resolution and is there- 
fore adopted as a criterion for choosing F. Combining this 
d co (G' , G) criterion and the minimal required genome pro- 
portion previously inferred and considering that the NGS 
data generated from most bacterial sequencing projects 
would be higher than 16x, an F from 10 to 15 should 
be sufficient for practical usage. 

Once the proper lvalue range was determined, we then 
tested the performance of co-phylog with arbitrarily 
selected parameters within the allowed value range 
(5= CisjsOj, F = 10) using the MAQ NGS data that 
are 'deeper' than 16x as input. We considered that, in 
practice, co-phylog will likely deal with NGS data from 
various independent sequencing projects; we therefore 
generated a 'mixed depth' testing dataset in which the 
NGS data of the Brucella 13 genomes were of different 
depths for the different organisms (the depth were 
specified arbitrarily provided that they were all 'deeper' 
than 16x). As we anticipated, the co-phylog tree based 
on this 'mixed depth' testing dataset was identical with 
the benchmark tree (Supplementary Figure S2b). 

Unassembled reads generally contain a significant 
number of sequencing errors, polymerase chain reaction 
(PCR) amplification redundancies and even contamin- 
ations. These effects were also evaluated: the extent of 
their impact was measured by the deviations between the 
benchmark co-distance (computed using assembled 
genomes, S = C 9 9 0i) and the corresponding co-distance 
(with parameters S=C g g 0 1 , F = 10) computed using 
NGS data simulated with different coverages and effects 
(Table 2). This analysis showed that the impacts of 
sequencing errors and PCR amplification redundancies 
on the proposed algorithm are negligible, but the impact 
of contaminations cannot be neglected. 

Performance on real NGS datasets 

The co-phylog (S=C 99 Oi, qualities control threshold 
F = 10) tree constructed using the NGS raw dataset of 
the 29 E. coli organisms is almost identical to the tree 
built using the corresponding assembled genomes 
(Figure 6). Given the accuracy of the co-phylog method 
based on assembled genomes has been proved previously, 
this test illustrates that co-phylog could be used in the 



phylogenetic analysis of unassembled NGS data. And as 
these NGS raw data came from all three popular 
sequencing platforms (454, Illumina and SOLID) 
(Supplementary Table SI), co-phylog is robust to the 
choice of sequencing platforms. The co-phylog tree (using 
5 = C/2,/20;, according to Supplementary Equation (S2) 
and F = 10) constructed on the NGS dataset from the five 
large diploid organisms matched well with the known 
taxonomy (Supplementary Figure S3), illustrating that 
the tool co-phylog can handle large size analyses and the 
complication of diploidy. 

The time and memory consumption of co-phylog 
program (coded in C) were tested on a platform equipped 
with Intel Xeon X5650 2.67 GHz cpu (only one cpu was 
used for this test) and SUSE Linux Enterprise Server 10 
SP2 (x86_64). For real NGS dataset from the E. coli 29, 
which have an average sequencing depth of 95 x , co-phylog 
took 160M memory and 19 min completing the whole 
computing, including 14 min converting all NGS data 
into corresponding context-object sets and 5 min 
computing all pairwise co-distances. For the five mamma- 
lian organisms, co-phylog took 60G memory and 20 h 
completing the whole computing, including 17 h converting 
all NGS data into corresponding context-object sets and 
3 h computing all pairwise co-distances. 

DISCUSSION 

The context-object model is a 'micro-alignment' process 

As we have previously introduced, the traditional se- 
quences alignment method is a 'seed match then extend' 
process. Recall that for computing the co-distance 
between two genomes Gj and G 2 , each member c, in the 
intersection R of the two context sets, C S (G;) and C S (G 2 ), 
is a context match that corresponds to the seed match in a 
spaced seed alignment. Unlike traditional alignment, the 
context-object model does not extend inter-seeds but 
instead extends intra-seeds (namely, the O-parts of the 
structure). Because the O-parts are short (typically 1 or 
a few base pairs), it is reasonable to ignore indels during 
extension. Extension is therefore directly comparing 
O-grams, and the context-object model is an alignment 
process with a span of only ~20~30bp, a so-called 
'micro-alignment' . 
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Table 2. The extents of co-distance biases due to sequencing error, PCR and contamination 



Coverage 




Error rate 


a (%) 




PCR b 


Contamination levels 0 (%) 


0.01 


0.05 


0.1 


1 


1 2 3 


20 


0 


0 


0 


2.6e-05 


1.2e-05 




50 


0 


4e-06 


8e-06 


0.000159 


0 




100 


0 


0 


2e-05 


0.000671 


0 


0.0015 0.0036 0.0053 



"The benchmark co-distance (0.018) was computed between genomes E. coli 536 and E. coli K12, and their NGS data were simulated with different 
coverages and error rates by the tool 'Art' (29). The error rates of popular NGS platforms ranged from 0.01 to 1%, according to (30). 
b NGS data simulated with biased PCR amplification were generated using the tool 'pirs'. which incorporated the coverage GC-content profile trained 
based on real NGS data (31). 

c The simulated scenario was that both NGS data sets (generated from genomes Shigella boydii and Shigella flexneri, with benchmark co-distance 
0.009) were contaminated by the sequences from E. coli K12. Then the co-distance was computed between the two contaminated NGS samples 
(add up to lOOx coverage for each sample). The contamination levels represent the proportions of the E. coli K12 sequences. 
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Figure 6. Comparison between the co-phylog tree constructed using assembled genomes of the E. coli 29 organisms and the co-phylog tree con- 
structed using their corresponding NGS raw data. The Escherichia fergusonii genome is adopted as the out-group taxon. 



There are two main features that make micro-alignment 
much more efficient than traditional alignment. First, a 
context match is created only once between the two 
compared genomes, while a normal 'seed match', which 
is shorter, is created many times in different regions, which 
slows down the calculation (19). Second, once a context 
match is created, as we have elaborated, extensions can be 
implemented by comparing two O-grams directly. As an 
O-gram can be stored in a 'word', an extension requires 
only one operation. In traditional alignment method, the 
seed match is extended through a dynamic programming 
process that requires polynomial operations. 

Micro-alignments do share a problem with the trad- 
itional alignment, namely that the homologies searching 
by using longer seed (or structure S) would lose distant 



homologies (19). This problem is more severe in 
micro-alignments because the C part of structure 5 must 
be long enough to ensure most C-grams from a genome 
can be mapped back to a unique region of the genome 
(therefore, to be the contexts), with increased genetic dif- 
ferences of the involved genomes, the counts of common 
context and phylogenetic information decreased more dra- 
matically, which hindered the application of the proposed 
approach to far distant organisms. 

Comparison against alignment-free methods 

Intuitively, the co-phylog method is somewhat similar to 
several alignment-free methods, especially those word 
frequencies methods. For example: Edgar et al. first 
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compute the normalized common /c-tuples count F(X,Y), 
from the common A>tuples C f XY between two sequences 
X and 7, using the following equation: 

F(X,Y)= Jf 1 ?*?., , (3) 
min(n,m) — k+l 

where w = \{A, C, G, T}| , for DNA sequences compari- 
sons, and n and m are the lengths of X and Y, respectively. 
This was then transformed into a distance, dp{X, Y) = 
—\og{0.\+F) (32). The similar calculation formulas of 
co-distance (Equation (1)) and the dp{X, Y) implies they 
have similar computing efficiency. However, their biolo- 
gical meanings are essentially different. As we have previ- 
ously elaborated, the context-object model is a 
micro-alignment process, which allows co-phylog to only 
call the nucleotide substitution events out of the entire 
range of genome variation. Therefore the co-distance 
computed, according to Equation (1), estimated the 
whole-genome average nucleotide (or amino acid) substi- 
tution rate of those sites with unchanged flanking se- 
quences (namely, 'context') between the two genomes 
compared. The calling of substitution events is critical 
for accurately constructing phylogenetic trees because 
the nucleotide (or amino acid) substitution rate is rela- 
tively constant across evolutionary history according to 
the molecular clock hypothesis. In contrast, the 
normalized common ^-tuples F(X,Y) could be affected 
by a wide range of genome variation events to different 
extents. For example, a gene lost event could decrement 
thousands of common A:-tuples, C t x , while a nucleotide 
substitution event could decrement only a few common 
A>tuples, the changes in min(n,m) — k+l are obviously 
not proportional with that of C { Y , thereby the F(X,Y) 
or df{X, Y) computed do not represent the rate of any 
evolutionary event. There is no unified evolution model 
for all of the types of genomic variations; therefore, the 
alignment-free distance metric, which do not distinguish 
between different types of genome variations cannot be 
accurate. 

In conclusion, the advantages and limitations of 
co-phylog method are obvious, co-phylog has similar 
computing efficiency with 'word frequencies' based on 
alignment-free methods, and in the mean time, it shares 
the accuracy with other alignment-based methods. 
However, co-phylog method does not perform well on 
far distant organisms. 
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